gravicom

Andrea Kaplan
March 27, 2014

A web-based tool for community detection in networks

Outline

  • Introduction/Background
  • User Interface
  • Demo
  • Graphical Devices
  • Technical Aspects
  • Further Work

Introduction

Who cares and why did we make this?

Networks

  • Many relationships easily conceptualized as a graph/network
  • A graph is defined as a collection of nodes (entities) and edges (relationships)
  • Examples of such relationships include:
    • social networks (sociology)
    • the world wide web (computer science)
    • protein networks (biology)

Network Examples

Social network of a karate club. Internet map 1024 - transparent The protein interaction network of Treponema pallidum

Community Detection

  • A community is defined as a group of nodes in a graph that share properties
  • Community structure - collection of nodes which are densely linked to nodes within the community and sparsely linked to notes outside
  • Current methodology for community detection often involves an algorithmic approach; partitions a graph into node clusters iteratively before stopping criterion
  • First define an objective function and then optimize

Modularity

  • Example objective function: Girvan & Newman's modularity measure

\[ Q = \sum\limits_r (e_{rr} - a_r^2) \] \( r = \) a community
\( e_{rr} = \) fraction of links that connect two nodes inside the community
\( a_r = \) the fraction of links that have one or both vertices inside the community

Modularity (Cont'd)

  • Interpretation: fraction of edges that connect vertices of the same type minus the expected value of the same quantity in a network with the same community divisions but random edges

\( Q = 0 \): number of within-community edges no better than random
\( Q \in [0.3, 0.7] \): strong community structure
\( Q=1 \): maximum value

Other Objective Functions

  • Modularity is not the only useful objective function
  • Simple proportion: quantification of density within the community vs. sparcity outside

\[ \pi(S) = \frac{\#\text{ of edges within } S}{\#\text{ of edges leaving} S} \]

  • Similar idea: Conductance

\[ \phi(S) = \frac{\#\text{ of edges leaving} S}{\sum\limits_{u \in S} \text{degree of } u} \]

Example Objective Functions

Site Components

\[ \begin{matrix} e_{AA} = \frac{3}{10} = 0.3 & e_{BB} = \frac{5}{10} = 0.5 \\ a_{A}^2 = \left(\frac{5}{10}\right)^2 = 0.25 & a_{B}^2 = \left(\frac{7}{10}\right)^2 = 0.49 \\ e_{AA} - a_A^2 = 0.05 & e_{BB} - a_B^2 = 0.01 \\ ~\\ Q = 0.06 & \end{matrix} \]
\[ \begin{matrix} \pi(A) = \frac{3}{2} = 1.5 & \pi(B) = \frac{5}{2} = 2.5 \end{matrix} \]
\[ \begin{matrix} \phi(A) = \frac{2}{8} = 0.25 & \phi(B) = \frac{2}{12} = 0.167 \end{matrix} \]

The Problem

  • Search for an optimal modularity value is NP-hard
  • The number of possible partitions of the network requires \( 2^n \) complexity
  • In fact, the optimization of an objective function is typically an NP-hard problem

Current Solutions

  • Heuristics used to optimize the objective function in a reasonable amount of time
  • Heuristic-based clustering is useful because this offers an automated way to perform community detection

The main elements of the problem themselves [graph clustering], i.e. the concepts of community and partition, are not rigorously defined, and require some degree of arbitrariness and/or common sense. (Fortunato, 2010)

  • Heuristics are not the solution.

Leverage the Human Visual System

  • Communities are often fuzzily-defined human concepts
  • Address this by adding element of human judgement
  • Introduce a novel visualization-based community detection tool, gravicom

gravicom Functionality

  • Allows users to
    • Visually direct and interact with the steps of community detection
    • Assess the created clusters through visual and quantitative summaries
  • Standalone exploratory tool for graph data
  • Initial state to be passed to a community detection algorithm in order reduce the complexity of optimization

User Interface

Meet gravicom

Components

Site Components (1) Control Panel, (2) Data Management, (3) Connection Table, (4) Graph Display, and (5) Tabset

Demo

http://shiny1.iastate.edu/andeek/gravicom
(must be VPNed to internal ISU network)

Football Example

Site Components

Football Example (Cont'd)

Conference Teams Identified Proportion Accuracy
SEC Vanderbilt, Florida, Louisiana State, South Carolina, Mississippi, Arkansas, Auburn, Kentucky, Georgia, Mississippi State, Alabama, Tennessee 1.50 100%
MAC Central Florida, Western Michigan, Miami Ohio, Ohio, Bowling Green State, Marshall, Ball State, Akron, Buffalo, Northern Illinois, Eastern Michigan, Toledo, Central Michigan, Kent 1.46 92.9%
Big 12 Kansas State, Iowa State, Kansas, Texas A& M, Texas Tech, Baylor, Missouri, Texas, Oklahoma State, Colorado, Oklahoma, Nebraska 1.44 100%
ACC Duke, Wake Forest, Virginia, Florida State, Clemson, North Carolina, Maryland, Georgia Tech, North Carolina State 1.44 100%
Pac-10 Arizona, Oregon State, Washington, Washington State, Arizona State, UC LA, Stanford, Southern California, Oregon, California 1.33 100%
Big 10 Ohio State, Penn State, Michigan, Michigan State, Purdue, Minnesota, Northwestern, Illinois, Iowa, Wisconsin, Indiana 1.22 100%

Football Example (Cont'd)

Conference Teams Identified Proportion Accuracy
WAC Nevada, Fresno State, Texas Christian, Tulsa, Hawaii, Rice, Southern Methodist, San Jose State, Texas El Paso 1.20 88.9%
Mountain West Brigham Young, San Diego State, Boise State, Wyoming, New Mexico, Nevada Las Vegas, Utah, North Texas, Utah State, New Mexico State, Colorado State, Arkansas State, Idaho, Air Force 0.96 57.1%
C-USA Cincinnati, Louisville, Houston, Tulane, Southern Mississippi, Army, Memphis, East Carolina, Alabama Birmingham 0.91 100%
Big East Boston College, Miami Florida, Virginia Tech, Syracuse, Temple, West Virginia, Connecticut, Pittsburgh, Rutgers 0.83 88.9%
Big West Middle Tennessee State, Louisiana Lafayette, Louisiana Monroe, Louisiana Tech 0.26 75%
Independent Notre Dame, Notre Dame, Navy, Navy 0.00 100%
  • Through manual specification of conferences, we were able to correctly classify 91.3 % of the football teams into their conferences.

Political Books Example

  • Political books co-purchased close to the 2004 United States presidential election and sold on Amazon.com Political Books

Political Books Example (Cont'd)

Political Books Grouped

Poltical Books Example (Cont'd)

Classification Books Identified Proportion Accuracy
Conservative A National Party No More, Dereliction of Duty, Ten Minutes from Normal, Bush Country, Rumsfeld’s War, Legacy, Hating America, Hillary’s Scheme, Meant To Be, Tales from the Left Coast, Breakdown, Losing Bin Laden, The French Betrayal of America, Spin Sisters, The Right Man, Useful Idiots, Shut Up and Sing, Who’s Looking Out for You?, Those Who Trespass, Bias, The O'Reilly Factor, Let Freedom Ring, Deliver Us from Evil, Give Me a Break, Betrayal, The Real America, The Faith of George W Bush, The Death of Right and Wrong, Power Plays, Arrogance, The Perfect Wife, The Bushes, Things Worth Fighting For, Off with Their Heads, Persecution, Why Courage Matters, Hollywood Interrupted, The Enemy Within, We Will Prevail, Endgame, The Official Handbook Vast Right Wing Conspiracy, The Third Terrorist, Slander, The Savage Nation, Fighting Back 8.04 93.3%

Political Books Example (Cont'd)

Classification Books Identified Proportion Accuracy
Liberal Downsize This!, The Culture of Fear, House of Bush, House of Saud, The Best Democracy Money Can Buy, Rogue Nation, Stupid White Men, Rush Limbaugh Is a Big Fat Idiot, The Great Unraveling, Against All Enemies, American Dynasty, The Price of Loyalty, The Sorrows of Empire, Worse Than Watergate, Plan of Attack, Big Lies, The Lies of George W. Bush, Bushwomen, The Bubble of American Supremacy, Living History, The Politics of Truth, Fanatics and Fools, Bushwhacked, Disarming Iraq, Lies and the Lying Liars Who Tell Them, MoveOn’s 50 Ways to Love Your Country, The Buying of the President 2004, Perfectly Legal, Bush at War, The New Pearl Harbor, Freethinkers, Had Enough?, It’s Still the Economy, Stupid!, We’re Right They’re Wrong, What Liberal Media?, The Clinton Wars, Weapons of Mass Deception, Dude, Where’s My Country?, Thieves in High Places, Shrub, Buck Up Suck Up, Hegemony or Survival, The Exception to the Rulers 7.42 95.2%

Political Books Example (Cont'd)

Classification Books Identified Proportion Accuracy
Neutral 1000 Years for Revenge, Bush vs. the Beltway, Charlie Wilson’s War, Dangerous Dimplomacy, Sleeping With the Devil, The Man Who Warned America, Why America Slept, Ghost Wars, Surprise, Security, the American Experience, Allies, The Choice, All the Shah’s Men, Soft Power, Colossus, The Future of Freedom, Rise of the Vulcans, America Unbound, Empire 1.06 50%
  • Correctly classified 86.67 % of the books into the categories created by the author of the data set.

Graphical Devices

Theory behind the curtain

Importance of Graph Layout

  • Graph layout is an assignment of a Cartesian coordinate to each node for display in 2D or 3D
  • Layout of a graph significantly affects the number of communities that users detect within a graph
  • Humans used to detect communities \( \Rightarrow \) special attention needs to be paid to the layout being used
  • Location of a node spatially relative to other nodes in a cluster has a significant effect on user: “adjacent nodes must be placed near to each other if possible” (McGrath, Blythe, and Krackhardt, 1996)

Force-Directed Layout

  • Physics based algorithm for graph drawing
  • Repulsive forces (charged particles) used to separate all pairs of nodes
  • Links are fixed-distance geometric constraints
  • Groups of nodes sharing multiple edges are pulled in closer proximity
  • Pseudo-gravity force keeps nodes centered in the visible area and avoids expulsion of disconnected subgraphs

Graph Simplification

  • Difficult to glean meaning from a visual representation in large/complex graphs
  • Replace repeated patterns in a graph by a representation, to simplify a network
  • Fewer nodes and edges to display \( \Rightarrow \) visual complexity of the graph visualization is greatly reduced
  • Allows the user to analyze the network structure more accurately

Technical Aspects

What makes it tick?

Page Lifecycle

Integrated Algorithm

Tools

  • Shiny: Server-client interaction
  • D3: User interface and graph layout
  • igraph/rjson: Data formatting

Tools (Cont'd)

Integrated Algorithm

Data Formatting

GML file structure:

graph
[
  directed 0
  node
  [
    id 0
    label "Node 1"
    value 100
  ]
  node
  [
    id 1
    label "Node 2"
    value 200
  ]
  edge
  [
    source 1
    target 0
  ]
]

JSON file structure:

{
  "nodes":
  [{"id":"n0","v_id":"0","v_label":"Node 1","v_value":"100"}, 
   {"id":"n1","v_id":"1","v_label":"Node 2","v_value":"200"}], 
 "edges":
  [{"source":0, "target":1}]
}

Further Work

Possible extensions to gravicom

Integrated Algorithmic Community Detection

  • Combine the benefits of human detection of communities with algorithmic detection
  • Visual detection of communities serves as an initialization step
  • Pass to iterative algorithm
  • User tracks progress and has power to dynamically set stopping criterion

How would it work?

Integrated Algorithm

Dynamic Temporal Graph Visualization

  • View a dynamic graph across time – how the edges change between nodes
  • Detect time-dependent communities
  • Add optional node labels to track progress

Questions?

Thank you!